Attention • Alen Rožac's Notes

Attention Is All You Need

Recap:

Attention:

Attention is a major paradigm shift:

The source sentence is fed into inputs
The target sentence is fed into outputs
Output probability is the next word
In contrast to RNN, the output of a single token is one sample. There's no multistep backprop.

Multi-Head Attention:

Combining the source sequence with the multi-head attention when the attention

The encoder of the source sentence discovers interesting things & builds Key-Value pairs
The encoder of the target sentence builds the Queries
The Values of the source sentence are indexed using Keys
The Query part asks about what the network would like

Advantages:

Look at: